
README file for Race, Voice, and Authority in Discussion Groups


The replication archive for this working paper includes the data and code needed to reproduce analyses for
Race, Voice, and Authority in Discussion Groups. Below, we describe the steps needed to reproduce the findings
and the files needed to do so.

The archive contains the following folders:
- Code (contains code files)
- Data (contains data files)
- DOCX Documents, ICR Transcripts, and Jury Transcripts (contain raw transcripts)


The analysis follows these steps:

0. Anonymize transcripts by replacing individuals' names***
- Code files:
	- 00_replace_names.R
- Takes the files:
	- Original versions of all transcripts in DOCX Documents, ICR Transcripts, and Jury Transcripts
	- Original versions of jurorspeech.csv, transcript_coding.xlsx, icr_identified_jurors.csv
	- names_key.csv, which links the original names to replacements
- Produces:
	- Modified versions of each file, swapping each unique first/last name with a random paired alternative every time it appears
***See the end note to this file for further explanation of the anonymizing process.

1. Process Word documents into .csv files of deliberator/group speech
- Code files:
	- 01_attach_text.R
- Takes the files:
	- /DOCX Documents/..., a series of transcripts of deliberations with as many individuals as possible
	identified with a unique name
	- jurorspeech.csv, a document linking unique names in transcripts to individual IDs in survey data
	- JuryDataSummer2004.csv, survey/admin data on each participant and their group
- Produces:
	- jurorspeechtrial.csv, a file with the text attributed to each juror
	- transcriptdetails.csv, a file with information about the coding of each transcript
	- /DOCX Documents/csv_speakingturns/..., csv files for each transcript in which 1 row = 1 speaking turn

2. Create individual-level and group-level datasets from full deliberations of each group
- Code files: 
	- 02_create_fulljury_data.R, which merges survey and transcript data, identifies preference mentions, 
	and assigns jury- and juror-level attributes 
	- word_to_number.R, a set of functions which transform verbal number mentions into readable numeric values 
- Takes the files:
	- /Jury transcripts/..., a series of transcripts without individual identifiers
	- jurorspeech.csv, a document linking unique names in transcripts to individual IDs in survey data
	- transcript_coding.xlsx, a document recording attempts to identify jurors by transcript
	- jurorspeechtrial.csv, a file with the text attributed to each juror (produced by attach_text.R)
- Produces:
	- jury_level.csv, which contains information about juries' transcripts and linkage attempts
	- jury_with_numbers.RData, which contains the preferences juries mention, cleaned
	- juror_level.RData, which contains info about jurors, plus their juries' mentions of their prefs

3. Create individual-level datasets about individuals' speech
- Code files:
	- 03_create_juror_data.R, which merges survey/admin and individual speech data
- Takes the files:
	- juror_level.RData, juror-level data on jury speech (produced by create_fulljury_data.R)
	- jurorspeechtrial.csv, a file with the text attributed to each juror (produced by attach_text.R)
	- jury_level.csv, jury-level data about transcripts (produced by create_fulljury_data.R)
	- transcriptdetails.csv, information about the coding of each transcript (produced by attach_text.R)
	- jury_with_numbers.RData, jury-level transcript data (produced by create_fulljury_data.R)
	- /DOCX Documents/csv_speakingturns/..., csv files for each transcript (produced by attach_text.R)
- Produces:
	- juror_mentions_byround.RData, juror-level transcript data
 
4. Produce cleaned juror- and jury-level datasets for final analysis
- Code files:
	- 04_create_finaldata.R
- Takes the files:
	- juror_level.RData, juror-level data on jury speech (produced by create_fulljury_data.R)
	- juror_mentions_byround.RData, juror-level transcript data (produced by create_juror_data.R)
	- JuryDataSummer2004.csv, survey/admin data on each participant and their group
- Produces:
	- tab1.csv, juror-level data about individual speech for final analysis
	- tab2.csv, juror-level data about group speech for final analysis

5. Analyze speech, timing, and preference mentions
- Code files:
	- analyses.Rmd, which produces the tables and figures in the main text, except those in "content"
	- sdi_do.do, which calculates additional statistics to produce Figures 2 and 3 in the main text
- Takes the files:
	- tab1.csv (produced by create_finaldata.R)
	- tab2.csv (produced by create_finaldata.R)


6. Analyze speech content 
- Code files:
	- content.R, which produces the tables and figures related to text analysis
- Takes the files: 
	- jury_with_numbers.RData, jury-level transcript data (produced by create_fulljury_data.R)
	- juror_level.RData, juror-level data on jury speech (produced by create_fulljury_data.R)
	- jurorspeechtrial.csv, a file with the text attributed to each juror (produced by attach_text.R)
- Produces:
	- word_differences.csv, a file of differences in word frequencies

7. Summarize intercoder reliability
- Code files:
	- 07_icr_checks.R
- Takes the files:
	- /ICR transcripts/..., a series of transcripts coded by multiple coders
	- icr_identified_jurors.csv, which contains multiple coders' identifications of jurors across several transcripts


As well as the document...
*Codebook.docx
---describes the variables in tab1.csv and tab2.csv used for analysis



*** Note on Anonymized Transcripts
Many of the original transcripts contained the first and/or last names of study participants. The versions of the transcripts
in this archive were altered to preserve the anonymity of the participants, while still allowing readers and code processes to 
consistently identify individuals throughout transcripts: for each name in the transcript, we randomly-selected a unique 
replacement name to substitute throughout the text. That is, for example, every time the name "Jane" appears in a transcript,
it would be replaced with "Mary."

To accomplish this, we used the udpipe package in R to identify potential proper nouns. We manually refined this list to include
first and last names, but exclude non-proper nouns, place names, and corporate names. We then identified each name as a
primarily feminine first name, a primarily masculine first name, or a last name. We used lexicon's list of common first and last
names to draw replacements within each category, resulting in a key file that contains a replacement for each original name.
Finally, we read in all the transcripts and data files containing names and used this key file to replace the original names.

The altered transcripts allow for the reproduction of all substantive patterns observed in the original analysis. Because different 
names can be treated slightly differently by various text processing steps, point estimates calculated using the altered 
transcripts deviate slightly from the estimates in the published paper. 

Please contact the authors for access to the original transcripts and the key file used to replace names. 